{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Chat with Code - RAG System with Codex Validation\n",
    "\n",
    "This notebook demonstrates a Retrieval-Augmented Generation (RAG) system that allows you to chat with code repositories. The system uses LlamaIndex for orchestration and Milvus for vector search, combined with Cleanlab Codex for response validation.\n",
    "\n",
    "## Features\n",
    "- Clone and parse GitHub repositories\n",
    "- Support for multiple file types (Python, JavaScript, TypeScript, Markdown, Jupyter notebooks)\n",
    "- Vector-based similarity search using Milvus\n",
    "- Custom prompt templates for better responses\n",
    "- Response validation using Cleanlab Codex"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 📦 Dependencies and Imports\n",
    "\n",
    "Setting up all required libraries for the RAG system:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import re\n",
    "import glob\n",
    "import subprocess\n",
    "import nest_asyncio\n",
    "from dotenv import load_dotenv\n",
    "from IPython.display import Markdown, display\n",
    "\n",
    "from llama_index.core import Settings\n",
    "from llama_index.llms.openrouter import OpenRouter\n",
    "from llama_index.core import PromptTemplate\n",
    "from llama_index.core import SimpleDirectoryReader\n",
    "from llama_index.core import VectorStoreIndex\n",
    "from llama_index.core.storage.storage_context import StorageContext\n",
    "from llama_index.core.node_parser import CodeSplitter, MarkdownNodeParser\n",
    "\n",
    "from llama_index.core.indices.vector_store.base import VectorStoreIndex\n",
    "from llama_index.embeddings.huggingface import HuggingFaceEmbedding\n",
    "from llama_index.vector_stores.milvus import MilvusVectorStore"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 🔧 Codex Client Setup\n",
    "\n",
    "Initialize Cleanlab Codex for response validation and quality assurance."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from cleanlab_codex.project import Project\n",
    "from cleanlab_codex.client import Client\n",
    "\n",
    "# Set your Codex API key (from https://codex.cleanlab.ai/account)\n",
    "os.environ[\"CODEX_API_KEY\"] = \"<your_codex_api_key_here>\"\n",
    "\n",
    "# Initialize Codex client and project\n",
    "codex_client = Client()\n",
    "project = codex_client.create_project(name=\"Chat-with-Code\", description=\"Code RAG project with added validation of Codex\")\n",
    "access_key = project.create_access_key(\"test-access-key\")\n",
    "project = Project.from_access_key(access_key)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## ⚙️ Configuration Setup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Allows nested access to the event loop\n",
    "nest_asyncio.apply()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 🤖 LLM and Embedding Model Configuration\n",
    "\n",
    "Setting up OpenRouter LLM and HuggingFace embedding model for the RAG pipeline."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "e4910b8ae3e44ad2a0ab4554db2c18d3",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "2fb6a9eac3854330a361e60227ce6fd4",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "6ad3c56ab0644c8dac1e2ee54017a91d",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "README.md: 0.00B [00:00, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "fc3e1f40505b419cbb94dbad08df4d75",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "e77c5fb1fcde426d8ab58b58f2878df5",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "config.json:   0%|          | 0.00/777 [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "2e26f0cd108a4f6ea8218ecaafacd034",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "9fcc9d43c120464eb6ac590e0ecc287a",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "e951f6b370a546ad974d5792425f8509",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "vocab.txt: 0.00B [00:00, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "f6cfbc9424bf4f648eda54f402a6d3e6",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "tokenizer.json: 0.00B [00:00, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "983efcb64d14473b95305f09ac754892",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "6f9dbd11585e43b9a789326bf6492018",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# Setting up the LLM\n",
    "llm = OpenRouter(api_key=\"<your_openrouter_api_key_here>\", model=\"qwen/qwen3-coder:free\")\n",
    "Settings.llm = llm\n",
    "\n",
    "# Setting up the embedding model\n",
    "Settings.embed_model = HuggingFaceEmbedding(model_name=\"BAAI/bge-base-en-v1.5\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 🛠️ Utility Functions\n",
    "\n",
    "Core functions for repository handling, document parsing, and index creation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def parse_github_url(url):\n",
    "    pattern = r\"https://github\\.com/([^/]+)/([^/]+)\"\n",
    "    match = re.match(pattern, url)\n",
    "    return match.groups() if match else (None, None)\n",
    "\n",
    "def clone_github_repo(repo_url):    \n",
    "    try:\n",
    "        print('Cloning the repo ...')\n",
    "        result = subprocess.run([\"git\", \"clone\", repo_url], check=True, text=True, capture_output=True)\n",
    "    except subprocess.CalledProcessError as e:\n",
    "        print(f\"Failed to clone repository: {e}\")\n",
    "        return None\n",
    "\n",
    "def validate_owner_repo(owner, repo):\n",
    "    return bool(owner) and bool(repo)\n",
    "\n",
    "def parse_docs_by_file_types(ext, language, input_dir_path):\n",
    "    try:\n",
    "        files = glob.glob(f\"{input_dir_path}/**/*{ext}\", recursive=True)\n",
    "        \n",
    "        if len(files) > 0:\n",
    "            loader = SimpleDirectoryReader(\n",
    "                input_dir=input_dir_path, required_exts=[ext], recursive=True\n",
    "            )\n",
    "            docs = loader.load_data()\n",
    "\n",
    "            parser = (\n",
    "                MarkdownNodeParser()\n",
    "                if ext == \".md\"\n",
    "                else CodeSplitter.from_defaults(language=language)\n",
    "            )\n",
    "            return parser.get_nodes_from_documents(docs)\n",
    "        else:\n",
    "            return []\n",
    "    except Exception as e:\n",
    "        print(f'Exception {e} occurred while parsing docs into nodes of file type {ext}')\n",
    "        return []\n",
    "\n",
    "def create_index(nodes):\n",
    "    vector_store = MilvusVectorStore(uri=\"http://localhost:19530\", dim=768, overwrite=True)\n",
    "    storage_context = StorageContext.from_defaults(vector_store=vector_store)\n",
    "    index = VectorStoreIndex(\n",
    "        nodes,\n",
    "        storage_context=storage_context,\n",
    "    )\n",
    "    return index"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 🔍 Query Engine Setup\n",
    "\n",
    "Main function to set up the complete RAG pipeline for a given GitHub repository."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def setup_query_engine(github_url):\n",
    "    owner, repo = parse_github_url(github_url)\n",
    "    \n",
    "    if validate_owner_repo(owner, repo):\n",
    "        # Clone the GitHub repo & save it in a directory\n",
    "        # input_dir_path = f\"/teamspace/studios/this_studio/{repo}\"\n",
    "        input_dir_path = os.path.join(os.getcwd(), repo)\n",
    "\n",
    "        if os.path.exists(input_dir_path):\n",
    "            pass\n",
    "        else:\n",
    "            clone_github_repo(github_url)\n",
    "\n",
    "        try:\n",
    "            file_types = {\n",
    "                \".md\": \"markdown\",\n",
    "                \".py\": \"python\",\n",
    "                \".ipynb\": \"python\",\n",
    "                \".js\": \"javascript\",\n",
    "                \".ts\": \"typescript\"\n",
    "            }\n",
    "\n",
    "            nodes = []\n",
    "            for ext, language in file_types.items():\n",
    "                nodes += parse_docs_by_file_types(ext, language, input_dir_path)\n",
    "\n",
    "            # ====== Create vector store index ======\n",
    "            try:\n",
    "                index = create_index(nodes)\n",
    "            except:\n",
    "                index = VectorStoreIndex(nodes=nodes, show_progress=True)\n",
    "\n",
    "            # TODO try async index creation for faster emebdding generation & persist it to memory!\n",
    "            # index = VectorStoreIndex(docs, use_async=True)\n",
    "\n",
    "            # ====== Setup a query engine ======\n",
    "            query_engine = index.as_query_engine(similarity_top_k=4)\n",
    "            \n",
    "            # ====== Customise prompt template ======\n",
    "            qa_prompt_tmpl_str = (\n",
    "                \"Context information is below.\\n\"\n",
    "                \"---------------------\\n\"\n",
    "                \"{context_str}\\n\"\n",
    "                \"---------------------\\n\"\n",
    "                \"Given the context information above, I want you to think step by step to answer the query in a crisp manner. \"\n",
    "                \"First, carefully check if the answer can be found in the provided context. \"\n",
    "                \"If the answer is available in the context, use that information to respond. \"\n",
    "                \"If the answer is not available in the context or the context is insufficient, \"\n",
    "                \"you may use your own knowledge to provide a helpful response. \"\n",
    "                \"Only say 'I don't know!' if you cannot answer the question using either the context or your general knowledge.\\n\"\n",
    "                \"Query: {query_str}\\n\"\n",
    "                \"Answer: \"\n",
    "            )\n",
    "\n",
    "            qa_prompt_tmpl = PromptTemplate(qa_prompt_tmpl_str)\n",
    "\n",
    "            query_engine.update_prompts(\n",
    "                {\"response_synthesizer:text_qa_template\": qa_prompt_tmpl}\n",
    "            )\n",
    "\n",
    "            if nodes:\n",
    "                print(\"Data loaded successfully!!\")\n",
    "                print(\"Ready to chat!!\")\n",
    "            else:\n",
    "                print(\"No data found, check if the repository is not empty!\")\n",
    "            \n",
    "            return query_engine\n",
    "\n",
    "        except Exception as e:\n",
    "            print(f\"An error occurred: {e}\")\n",
    "    else:\n",
    "        print('Invalid github repo, try again!')\n",
    "        return None"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 🚀 Usage Example\n",
    "\n",
    "Let's test the system with a sample repository."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Cloning the repo ...\n",
      "Data loaded successfully!!\n",
      "Ready to chat!!\n"
     ]
    }
   ],
   "source": [
    "# Provide url to the repository you want to chat with\n",
    "github_url = \"https://github.com/sitamgithub-MSIT/ClassyText\"\n",
    "\n",
    "query_engine = setup_query_engine(github_url=github_url)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 💬 Basic Query Test\n",
    "\n",
    "Testing the query engine with a simple question."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "The name of the Zero-shot Text Classification model used in this project is **ModernBERT-large-zeroshot-v2.0**."
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "response = query_engine.query(\"What is the name of the Zero-shot Text Classification model used in this project?\")\n",
    "display(Markdown(str(response)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## ✅ Codex-Enhanced Query System\n",
    "\n",
    "Enhanced query function that includes Cleanlab Codex validation for improved response quality and reliability."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fallback_response = \"I'm sorry, I couldn't find an answer for that — can I help with something else?\"\n",
    "\n",
    "\n",
    "def codex_validated_query(query_engine, user_query):\n",
    "    # Step 1: Get response from your RAG pipeline\n",
    "    response_obj = query_engine.query(user_query)\n",
    "    initial_response = str(response_obj)\n",
    "\n",
    "    # Step 2: Convert to message format\n",
    "    context = response_obj.source_nodes\n",
    "    context_str = \"\\n\".join([n.node.text for n in context])\n",
    "\n",
    "    prompt_template = (\n",
    "        \"Context information is below.\\n\"\n",
    "        \"---------------------\\n\"\n",
    "        \"{context}\\n\"\n",
    "        \"---------------------\\n\"\n",
    "        \"Given the context information above, I want you to think step by step to answer the query in a crisp manner. \"\n",
    "        \"First, carefully check if the answer can be found in the provided context. \"\n",
    "        \"If the answer is available in the context, use that information to respond. \"\n",
    "        \"If the answer is not available in the context or the context is insufficient, \"\n",
    "        \"you may use your own knowledge to provide a helpful response. \"\n",
    "        \"Only say 'I don't know!' if you cannot answer the question using either the context or your general knowledge.\\n\"\n",
    "        \"Query: {query}\\n\"\n",
    "        \"Answer: \"\n",
    "    )\n",
    "    user_prompt = prompt_template.format(context=context_str, query=user_query)\n",
    "    messages = [{\n",
    "        \"role\": \"user\",\n",
    "        \"content\": user_prompt,\n",
    "    }]\n",
    "\n",
    "    # Step 3: Validate with Codex\n",
    "    result = project.validate(\n",
    "        messages=messages,\n",
    "        query=user_query,\n",
    "        context=context_str,\n",
    "        response=initial_response,\n",
    "    )\n",
    "\n",
    "    # Step 4: Return Codex-evaluated final response\n",
    "    final_response = (\n",
    "        result.expert_answer\n",
    "        if result.expert_answer and result.escalated_to_sme\n",
    "        else fallback_response if result.should_guardrail\n",
    "        else initial_response\n",
    "    )\n",
    "\n",
    "    # Step 5: Return both final response and full validation info\n",
    "    return {\n",
    "        \"final_response\": final_response,\n",
    "        \"validation_results\": result.model_dump()\n",
    "    }"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 🧪 Testing Codex-Validated Responses\n",
    "\n",
    "Compare the validated response with detailed validation metrics."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Final Answer:\n",
      " The name of the Zero-shot Text Classification model used in this project is **ModernBERT-large-zeroshot-v2.0**.\n",
      "\n",
      "Validation Results:\n",
      "  deterministic_guardrails_results: {}\n",
      "  escalated_to_sme: False\n",
      "  eval_scores: {'trustworthiness': {'score': 0.99999998338089, 'triggered': False, 'triggered_escalation': False, 'triggered_guardrail': False, 'failed': False, 'log': None}, 'context_sufficiency': {'score': 0.99751243781125, 'triggered': False, 'triggered_escalation': False, 'triggered_guardrail': False, 'failed': False, 'log': None}, 'response_helpfulness': {'score': 0.9975124377834605, 'triggered': False, 'triggered_escalation': False, 'triggered_guardrail': False, 'failed': False, 'log': None}, 'query_ease': {'score': 0.7938874203515002, 'triggered': False, 'triggered_escalation': False, 'triggered_guardrail': False, 'failed': False, 'log': None}, 'response_groundedness': {'score': 0.9975124378111279, 'triggered': False, 'triggered_escalation': False, 'triggered_guardrail': False, 'failed': False, 'log': None}}\n",
      "  expert_answer: None\n",
      "  is_bad_response: False\n",
      "  should_guardrail: False\n"
     ]
    }
   ],
   "source": [
    "output = codex_validated_query(query_engine, \"What is the name of the Zero-shot Text Classification model used in this project?\")\n",
    "\n",
    "print(\"Final Answer:\\n\", output[\"final_response\"])\n",
    "print(\"\\nValidation Results:\")\n",
    "for k, v in output[\"validation_results\"].items():\n",
    "    print(f\"  {k}: {v}\")"
   ]
  }
 ],
 "metadata": {
  "language_info": {
   "name": "python"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
